Dynamic Window GAM Analysis: Sunset Window (Max Count → Functional Sunset)

Author

Kyle Nessen

Published

October 2, 2025

1 Background and Rationale

This analysis addresses feedback from Francis regarding the temporal alignment of weather predictors with butterfly responses. The original daily-level analysis used fixed 6am-6pm weather windows, which Francis identified as having a temporal logic issue:

“All of the metrics for wind, temperature and light, would need to be re-calculated for 24 hour periods that begin at the time of the highest count.”

1.1 Francis’s Key Points

  1. Temporal alignment: Butterflies can only respond to weather after it occurs
  2. Biological timing: If max count occurred at 2pm on day t-1, the relevant weather window should start at 2pm, not 6am
  3. Roosting decisions: Weather from max count through sunset determines whether butterflies abandon the roost

1.2 This Analysis

Weather window: time_of_max_count_t_1last_observation_time_t (functional sunset)

  • Weather metrics now include overnight conditions (24/7 temperature and wind)
  • Window lengths vary by observation (mean: 29.6 hours, range: 22.5-34.9 hours)
  • Captures weather exposure until roosting decisions are finalized at dusk
  • Tests the hypothesis: “Do weather conditions from peak count to sunset predict roost abandonment?”

Data source: data/monarch_daily_lag_analysis_sunset_window.csv (103 lag pairs)


2 Setup

Code
library(tidyverse)
library(mgcv)
library(knitr)
library(here)
library(corrplot)
library(patchwork)
library(gratia)

# Load the sunset window data
daily_data <- read_csv(here("data", "monarch_daily_lag_analysis_sunset_window.csv"))

# Create the square root transformed response variable
daily_data <- daily_data %>%
    mutate(
        butterfly_diff_95th_sqrt = ifelse(butterfly_diff_95th >= 0,
            sqrt(butterfly_diff_95th),
            -sqrt(-butterfly_diff_95th)
        )
    )

3 Data Overview

Code
cat("Dataset dimensions:", nrow(daily_data), "rows x", ncol(daily_data), "columns\n")
Dataset dimensions: 103 rows x 52 columns
Code
cat("Number of deployments:", n_distinct(daily_data$deployment_id), "\n")
Number of deployments: 7 
Code
cat("Date range:", min(daily_data$date_t), "to", max(daily_data$date_t), "\n\n")
Date range: 19680 to 19756 
Code
cat("Window characteristics:\n")
Window characteristics:
Code
cat("- Mean lag duration:", round(mean(daily_data$lag_duration_hours, na.rm = TRUE), 2), "hours\n")
- Mean lag duration: 29.56 hours
Code
cat("- Duration range:", round(min(daily_data$lag_duration_hours, na.rm = TRUE), 2), "-",
    round(max(daily_data$lag_duration_hours, na.rm = TRUE), 2), "hours\n")
- Duration range: 22.5 - 34.9 hours
Code
cat("- Median data completeness:", round(median(daily_data$metrics_complete, na.rm = TRUE), 3), "\n\n")
- Median data completeness: 1 
Code
# Summary of key variables
summary_vars <- daily_data %>%
    select(
        butterflies_95th_percentile_t,
        butterflies_95th_percentile_t_1,
        butterfly_diff_95th,
        temp_max,
        temp_min,
        temp_at_max_count_t_1,
        wind_max_gust,
        sum_butterflies_direct_sun,
        lag_duration_hours,
        metrics_complete
    )

kable(summary(summary_vars), caption = "Summary statistics for key variables")
Summary statistics for key variables
butterflies_95th_percentile_t butterflies_95th_percentile_t_1 butterfly_diff_95th temp_max temp_min temp_at_max_count_t_1 wind_max_gust sum_butterflies_direct_sun lag_duration_hours metrics_complete
Min. : 0.00 Min. : 0.0 Min. :-310.000 Min. :14.00 Min. : 3.000 Min. : 5.00 Min. : 0.000 Min. : 0.0 Min. :22.50 Min. :0.0000
1st Qu.: 14.85 1st Qu.: 17.5 1st Qu.: -31.000 1st Qu.:17.00 1st Qu.: 6.000 1st Qu.:11.50 1st Qu.: 3.200 1st Qu.: 12.0 1st Qu.:25.50 1st Qu.:0.9996
Median : 70.05 Median : 77.0 Median : -2.950 Median :19.00 Median : 8.000 Median :14.00 Median : 4.100 Median : 42.0 Median :30.50 Median :0.9996
Mean :107.41 Mean :116.3 Mean : -8.919 Mean :20.47 Mean : 8.311 Mean :13.37 Mean : 4.391 Mean : 132.8 Mean :29.56 Mean :0.9593
3rd Qu.:166.95 3rd Qu.:199.5 3rd Qu.: 18.000 3rd Qu.:23.00 3rd Qu.:11.000 3rd Qu.:15.50 3rd Qu.: 5.100 3rd Qu.: 168.5 3rd Qu.:33.17 3rd Qu.:0.9997
Max. :499.00 Max. :499.0 Max. : 256.600 Max. :34.00 Max. :14.000 Max. :25.00 Max. :12.800 Max. :1122.0 Max. :34.90 Max. :0.9998
NA NA NA NA NA NA NA’s :2 NA NA NA

4 Response Variable Selection

We evaluate three butterfly difference metrics (max, 95th percentile, and top 3 mean) with three transformations each (untransformed, square root, and square) to determine which best approximates normality.

Code
# Filter to complete data for consistency with modeling
data_filtered <- daily_data %>% filter(metrics_complete >= 0.95)

# Define response variables
response_vars <- c(
    "butterfly_diff", "butterfly_diff_sqrt", "butterfly_diff_sq",
    "butterfly_diff_95th", "butterfly_diff_95th_sqrt", "butterfly_diff_95th_sq",
    "butterfly_diff_top3", "butterfly_diff_top3_sqrt", "butterfly_diff_top3_sq"
)

# Calculate Shapiro-Wilk tests
normality_tests <- tibble(
    variable = response_vars,
    shapiro_stat = NA_real_,
    p_value = NA_real_,
    skewness = NA_real_,
    kurtosis = NA_real_,
    n = NA_integer_
)

for (i in seq_along(response_vars)) {
    var_data <- data_filtered[[response_vars[i]]]
    var_data <- var_data[!is.na(var_data)]

    sw_test <- shapiro.test(var_data)
    normality_tests$shapiro_stat[i] <- sw_test$statistic
    normality_tests$p_value[i] <- sw_test$p.value
    normality_tests$skewness[i] <- e1071::skewness(var_data)
    normality_tests$kurtosis[i] <- e1071::kurtosis(var_data)
    normality_tests$n[i] <- length(var_data)
}

# Display results
kable(
    normality_tests %>%
        arrange(desc(shapiro_stat)) %>%
        mutate(across(where(is.numeric), ~round(.x, 4))),
    caption = "Shapiro-Wilk Normality Tests (sorted by W statistic)"
)
Shapiro-Wilk Normality Tests (sorted by W statistic)
variable shapiro_stat p_value skewness kurtosis n
butterfly_diff_sqrt 0.9893 0.6368 0.2251 -0.3143 96
butterfly_diff_95th_sqrt 0.9880 0.5411 0.0120 -0.6497 96
butterfly_diff_top3_sqrt 0.9875 0.5027 0.0266 -0.6207 96
butterfly_diff_95th 0.9182 0.0000 -0.3705 2.0466 96
butterfly_diff_top3 0.9126 0.0000 -0.0126 2.4706 96
butterfly_diff 0.8835 0.0000 0.3794 4.3938 96
butterfly_diff_95th_sq 0.6591 0.0000 -1.2037 9.1440 96
butterfly_diff_top3_sq 0.6327 0.0000 0.6220 10.0799 96
butterfly_diff_sq 0.5307 0.0000 1.9996 19.7056 96
Code
# Identify best
best_response <- normality_tests %>%
    filter(shapiro_stat == max(shapiro_stat)) %>%
    pull(variable)

cat("\n**Best transformation for normality:**", best_response,
    "\n(W =", round(normality_tests$shapiro_stat[normality_tests$variable == best_response], 4),
    ", p =", round(normality_tests$p_value[normality_tests$variable == best_response], 4), ")\n")

**Best transformation for normality:** butterfly_diff_sqrt 
(W = 0.9893 , p = 0.6368 )
Code
# Create histograms
par(mfrow = c(3, 3), mar = c(4, 4, 3, 1))
for (var in response_vars) {
    var_data <- data_filtered[[var]]
    var_data <- var_data[!is.na(var_data)]

    # Histogram
    hist(var_data, breaks = 30, probability = TRUE,
         main = sprintf("%s\n(W=%.3f, p=%.4f)",
                       var,
                       normality_tests$shapiro_stat[normality_tests$variable == var],
                       normality_tests$p_value[normality_tests$variable == var]),
         xlab = "Value", col = "steelblue", border = "black")

    # Overlay normal distribution
    x_seq <- seq(min(var_data), max(var_data), length.out = 100)
    lines(x_seq, dnorm(x_seq, mean(var_data), sd(var_data)),
          col = "red", lwd = 2)

    # Add grid
    grid()
}

4.1 Key Findings

  • Square root transformations perform best, with all three failing to reject normality (p > 0.5)
  • Untransformed differences show significant departures from normality (p < 0.0001)
  • Square transformations severely violate normality with extreme kurtosis (>9)
  • butterfly_diff_sqrt has the highest Shapiro-Wilk statistic (W = 0.9893, p = 0.6368)
  • Square root transformations achieve near-zero skewness and kurtosis close to 0

Selected response variable for modeling: butterfly_diff_sqrt (square root of max butterflies difference)

5 Candidate Predictor Variables

Following the original analysis structure, we consider all potential weather and baseline metrics, then use correlation analysis to select final predictors.

5.1 All Candidate Variables

Code
# Define all candidate predictors
butterfly_baseline_vars <- c(
    "butterflies_95th_percentile_t_1",
    "max_butterflies_t_1",
    "butterflies_top3_mean_t_1"
)

temperature_vars <- c(
    "temp_max",
    "temp_min",
    "temp_mean",
    "temp_at_max_count_t_1",
    "hours_above_15C",
    "degree_hours_above_15C"
)

wind_vars <- c(
    "wind_avg_sustained",
    "wind_max_gust",
    "wind_gust_sum",
    "wind_gust_sum_above_2ms",
    "wind_gust_hours",
    "wind_minutes_above_2ms",
    "wind_gust_sd",
    "wind_mode_gust"
)

sun_vars <- c(
    "sum_butterflies_direct_sun"
)

window_vars <- c(
    "lag_duration_hours"
)

# Combine all candidates
all_predictors <- c(
    butterfly_baseline_vars,
    temperature_vars,
    wind_vars,
    sun_vars,
    window_vars
)

cat("Total candidate predictors:", length(all_predictors), "\n")
Total candidate predictors: 19 
Code
cat("- Butterfly baseline:", length(butterfly_baseline_vars), "\n")
- Butterfly baseline: 3 
Code
cat("- Temperature:", length(temperature_vars), "\n")
- Temperature: 6 
Code
cat("- Wind:", length(wind_vars), "\n")
- Wind: 8 
Code
cat("- Sun exposure:", length(sun_vars), "\n")
- Sun exposure: 1 
Code
cat("- Window characteristics:", length(window_vars), "\n")
- Window characteristics: 1 

5.2 Variable Descriptions

Code
var_descriptions <- tribble(
    ~Variable, ~Description, ~Type,
    "butterflies_95th_percentile_t_1", "95th percentile count on previous day (baseline)", "Baseline",
    "max_butterflies_t_1", "Maximum count on previous day", "Baseline",
    "butterflies_top3_mean_t_1", "Mean of top 3 counts on previous day", "Baseline",
    "temp_max", "Max temp from max count to sunset (includes overnight)", "Temperature",
    "temp_min", "Min temp from max count to sunset (includes overnight)", "Temperature",
    "temp_mean", "Mean temp from max count to sunset", "Temperature",
    "temp_at_max_count_t_1", "Temperature when max count occurred", "Temperature",
    "hours_above_15C", "Hours ≥15°C in window", "Temperature",
    "degree_hours_above_15C", "Cumulative degree-hours >15°C", "Temperature",
    "wind_avg_sustained", "Mean sustained wind speed in window", "Wind",
    "wind_max_gust", "Maximum gust in window (includes overnight)", "Wind",
    "wind_gust_sum", "Sum of all gust measurements", "Wind",
    "wind_gust_sum_above_2ms", "Sum of gusts >2 m/s", "Wind",
    "wind_gust_hours", "Gust-hours (integral)", "Wind",
    "wind_minutes_above_2ms", "Minutes with wind ≥2 m/s", "Wind",
    "wind_gust_sd", "SD of gust speeds (variability)", "Wind",
    "wind_mode_gust", "Most frequent gust speed", "Wind",
    "sum_butterflies_direct_sun", "Sum of butterflies in direct sun (entire lag window)", "Sun",
    "lag_duration_hours", "Window length in hours", "Window"
)

kable(var_descriptions, caption = "Candidate predictor variables")
Candidate predictor variables
Variable Description Type
butterflies_95th_percentile_t_1 95th percentile count on previous day (baseline) Baseline
max_butterflies_t_1 Maximum count on previous day Baseline
butterflies_top3_mean_t_1 Mean of top 3 counts on previous day Baseline
temp_max Max temp from max count to sunset (includes overnight) Temperature
temp_min Min temp from max count to sunset (includes overnight) Temperature
temp_mean Mean temp from max count to sunset Temperature
temp_at_max_count_t_1 Temperature when max count occurred Temperature
hours_above_15C Hours <U+2265>15<U+00B0>C in window Temperature
degree_hours_above_15C Cumulative degree-hours >15<U+00B0>C Temperature
wind_avg_sustained Mean sustained wind speed in window Wind
wind_max_gust Maximum gust in window (includes overnight) Wind
wind_gust_sum Sum of all gust measurements Wind
wind_gust_sum_above_2ms Sum of gusts >2 m/s Wind
wind_gust_hours Gust-hours (integral) Wind
wind_minutes_above_2ms Minutes with wind <U+2265>2 m/s Wind
wind_gust_sd SD of gust speeds (variability) Wind
wind_mode_gust Most frequent gust speed Wind
sum_butterflies_direct_sun Sum of butterflies in direct sun (entire lag window) Sun
lag_duration_hours Window length in hours Window

6 Data Quality Assessment

Code
cat("Data completeness summary:\n")
Data completeness summary:
Code
cat("- Observations with metrics_complete = 0:", sum(daily_data$metrics_complete == 0), "\n")
- Observations with metrics_complete = 0: 2 
Code
cat("- Observations with wind_data_coverage < 0.5:", sum(daily_data$wind_data_coverage < 0.5), "\n")
- Observations with wind_data_coverage < 0.5: 5 
Code
cat("- Mean temperature coverage:", round(mean(daily_data$temp_data_coverage), 3), "\n")
- Mean temperature coverage: 1 
Code
cat("- Mean wind coverage:", round(mean(daily_data$wind_data_coverage), 3), "\n")
- Mean wind coverage: 0.952 
Code
cat("- Mean butterfly coverage:", round(mean(daily_data$butterfly_data_coverage), 3), "\n\n")
- Mean butterfly coverage: 0.989 
Code
# Show observations with low wind coverage
low_wind <- daily_data %>%
    filter(wind_data_coverage < 0.5) %>%
    select(deployment_id, date_t_1, date_t, wind_data_coverage,
           butterflies_95th_percentile_t_1, butterflies_95th_percentile_t)

if (nrow(low_wind) > 0) {
    cat("Observations with <50% wind coverage:\n")
    kable(low_wind,
          caption = "Low wind data coverage (likely wind database gaps, not zero-butterfly exclusions)",
          digits = 3)
} else {
    cat("All observations have adequate wind coverage\n")
}
Observations with <50% wind coverage:
Low wind data coverage (likely wind database gaps, not zero-butterfly exclusions)
deployment_id date_t_1 date_t wind_data_coverage butterflies_95th_percentile_t_1 butterflies_95th_percentile_t
SC10 2024-01-27 2024-01-28 0.185 0.0 6.70
SC10 2024-01-28 2024-01-29 0.043 6.7 14.70
SC10 2024-01-29 2024-01-30 0.000 14.7 9.00
SC10 2024-01-30 2024-01-31 0.000 9.0 7.95
SC9 2024-01-28 2024-01-29 0.040 18.3 1.00
Code
# Display the diagnostic plot created earlier
knitr::include_graphics(here("analysis", "dynamic_window_analysis", "data_quality_diagnostics.png"))

Note: 5 observations (all from late January 2024) have limited wind data coverage. These appear to be gaps in the wind database rather than issues with the data preparation logic. All observations have butterflies present.

6.1 Filtering Low Quality Observations

Code
# Filter observations with metrics_complete < 0.95
low_quality <- daily_data %>%
    filter(metrics_complete < 0.95)

cat("Observations to exclude (metrics_complete < 0.95):", nrow(low_quality), "\n")
Observations to exclude (metrics_complete < 0.95): 7 
Code
cat("Percentage of dataset:", round(nrow(low_quality) / nrow(daily_data) * 100, 1), "%\n\n")
Percentage of dataset: 6.8 %
Code
if (nrow(low_quality) > 0) {
    cat("Excluded observations have:\n")
    cat("- Mean butterflies_95th_t_1:", round(mean(low_quality$butterflies_95th_percentile_t_1), 1), "\n")
    cat("- Mean butterflies_95th_t:", round(mean(low_quality$butterflies_95th_percentile_t), 1), "\n")
    cat("- Mean |butterfly_diff_95th|:", round(mean(abs(low_quality$butterfly_diff_95th)), 1), "\n")
    cat("- Mean metrics_complete:", round(mean(low_quality$metrics_complete), 3), "\n\n")
}
Excluded observations have:
- Mean butterflies_95th_t_1: 22.9 
- Mean butterflies_95th_t: 19.2 
- Mean |butterfly_diff_95th|: 7.9 
- Mean metrics_complete: 0.45 
Code
# Apply filter
daily_data <- daily_data %>%
    filter(metrics_complete >= 0.95)

cat("After filtering:\n")
After filtering:
Code
cat("- Observations retained:", nrow(daily_data), "\n")
- Observations retained: 96 
Code
cat("- Mean butterflies_95th_t_1:", round(mean(daily_data$butterflies_95th_percentile_t_1), 1), "\n")
- Mean butterflies_95th_t_1: 123.1 
Code
cat("- Mean butterflies_95th_t:", round(mean(daily_data$butterflies_95th_percentile_t), 1), "\n")
- Mean butterflies_95th_t: 113.8 
Code
cat("- Mean |butterfly_diff_95th|:", round(mean(abs(daily_data$butterfly_diff_95th)), 1), "\n")
- Mean |butterfly_diff_95th|: 58 
Code
cat("- Mean metrics_complete:", round(mean(daily_data$metrics_complete), 3), "\n")
- Mean metrics_complete: 0.996 

Rationale for exclusion: The 7 excluded observations (6.8% of dataset) have relatively small butterfly counts (mean 95th percentile: 22.9 vs 123.1 for kept data) and incomplete weather data that could bias model estimates.

7 Correlation Matrix: All Candidate Predictors

This correlation matrix shows all potential fixed effects to help identify multicollinearity and guide variable selection.

Code
# Select all candidate predictors that exist in the dataset
available_predictors <- all_predictors[all_predictors %in% names(daily_data)]

# Create correlation matrix
predictor_data <- daily_data %>%
    select(all_of(available_predictors)) %>%
    na.omit()

cor_matrix <- cor(predictor_data)

# Create correlation plot
corrplot(cor_matrix,
    method = "color",
    type = "upper",
    order = "original",
    tl.cex = 0.8,
    tl.col = "black",
    tl.srt = 45,
    addCoef.col = "black",
    number.cex = 0.6,
    title = "Correlation Matrix: All Candidate Predictors (Sunset Window)",
    mar = c(0, 0, 2, 0)
)

Code
# Print correlation matrix as table
kable(round(cor_matrix, 3),
      caption = "Correlation coefficients for all candidate predictors")
Correlation coefficients for all candidate predictors
butterflies_95th_percentile_t_1 max_butterflies_t_1 butterflies_top3_mean_t_1 temp_max temp_min temp_mean temp_at_max_count_t_1 hours_above_15C degree_hours_above_15C wind_avg_sustained wind_max_gust wind_gust_sum wind_gust_sum_above_2ms wind_gust_hours wind_minutes_above_2ms wind_gust_sd wind_mode_gust sum_butterflies_direct_sun lag_duration_hours
butterflies_95th_percentile_t_1 1.000 0.968 0.995 -0.027 -0.294 -0.219 -0.124 -0.163 -0.014 -0.155 -0.306 -0.194 -0.220 -0.194 -0.233 -0.240 -0.110 0.381 -0.082
max_butterflies_t_1 0.968 1.000 0.982 -0.016 -0.293 -0.215 -0.076 -0.160 -0.014 -0.151 -0.307 -0.198 -0.207 -0.198 -0.216 -0.204 -0.129 0.480 -0.126
butterflies_top3_mean_t_1 0.995 0.982 1.000 -0.038 -0.302 -0.237 -0.118 -0.177 -0.028 -0.154 -0.310 -0.196 -0.219 -0.196 -0.226 -0.233 -0.117 0.406 -0.093
temp_max -0.027 -0.016 -0.038 1.000 0.056 0.711 0.139 0.606 0.919 -0.476 -0.218 -0.388 -0.317 -0.388 -0.364 -0.186 -0.427 -0.012 0.157
temp_min -0.294 -0.293 -0.302 0.056 1.000 0.639 0.366 0.365 0.070 0.107 0.232 0.168 0.156 0.168 0.167 0.118 0.240 -0.330 0.086
temp_mean -0.219 -0.215 -0.237 0.711 0.639 1.000 0.230 0.767 0.754 -0.178 0.064 -0.067 -0.030 -0.067 -0.059 0.016 -0.084 -0.177 0.297
temp_at_max_count_t_1 -0.124 -0.076 -0.118 0.139 0.366 0.230 1.000 0.138 0.039 -0.140 -0.105 -0.229 -0.177 -0.229 -0.177 -0.058 -0.101 -0.085 -0.574
hours_above_15C -0.163 -0.160 -0.177 0.606 0.365 0.767 0.138 1.000 0.694 -0.121 0.058 -0.006 0.003 -0.006 0.002 0.029 -0.096 -0.087 0.374
degree_hours_above_15C -0.014 -0.014 -0.028 0.919 0.070 0.754 0.039 0.694 1.000 -0.396 -0.196 -0.309 -0.269 -0.309 -0.294 -0.169 -0.340 -0.037 0.258
wind_avg_sustained -0.155 -0.151 -0.154 -0.476 0.107 -0.178 -0.140 -0.121 -0.396 1.000 0.740 0.956 0.917 0.956 0.938 0.586 0.824 0.096 0.142
wind_max_gust -0.306 -0.307 -0.310 -0.218 0.232 0.064 -0.105 0.058 -0.196 0.740 1.000 0.824 0.830 0.824 0.767 0.830 0.539 -0.159 0.314
wind_gust_sum -0.194 -0.198 -0.196 -0.388 0.168 -0.067 -0.229 -0.006 -0.309 0.956 0.824 1.000 0.953 1.000 0.963 0.648 0.800 0.006 0.372
wind_gust_sum_above_2ms -0.220 -0.207 -0.219 -0.317 0.156 -0.030 -0.177 0.003 -0.269 0.917 0.830 0.953 1.000 0.953 0.962 0.738 0.683 0.033 0.299
wind_gust_hours -0.194 -0.198 -0.196 -0.388 0.168 -0.067 -0.229 -0.006 -0.309 0.956 0.824 1.000 0.953 1.000 0.963 0.648 0.800 0.006 0.372
wind_minutes_above_2ms -0.233 -0.216 -0.226 -0.364 0.167 -0.059 -0.177 0.002 -0.294 0.938 0.767 0.963 0.962 0.963 1.000 0.630 0.743 0.053 0.300
wind_gust_sd -0.240 -0.204 -0.233 -0.186 0.118 0.016 -0.058 0.029 -0.169 0.586 0.830 0.648 0.738 0.648 0.630 1.000 0.226 -0.022 0.211
wind_mode_gust -0.110 -0.129 -0.117 -0.427 0.240 -0.084 -0.101 -0.096 -0.340 0.824 0.539 0.800 0.683 0.800 0.743 0.226 1.000 -0.039 0.153
sum_butterflies_direct_sun 0.381 0.480 0.406 -0.012 -0.330 -0.177 -0.085 -0.087 -0.037 0.096 -0.159 0.006 0.033 0.006 0.053 -0.022 -0.039 1.000 -0.116
lag_duration_hours -0.082 -0.126 -0.093 0.157 0.086 0.297 -0.574 0.374 0.258 0.142 0.314 0.372 0.299 0.372 0.300 0.211 0.153 -0.116 1.000

7.1 Highly Correlated Pairs (|r| > 0.7)

Code
# Find pairs with high correlation
high_cor_pairs <- data.frame()

for (i in 1:(nrow(cor_matrix)-1)) {
    for (j in (i+1):ncol(cor_matrix)) {
        cor_val <- cor_matrix[i, j]
        if (abs(cor_val) > 0.7) {
            high_cor_pairs <- rbind(
                high_cor_pairs,
                data.frame(
                    Var1 = rownames(cor_matrix)[i],
                    Var2 = colnames(cor_matrix)[j],
                    Correlation = round(cor_val, 3)
                )
            )
        }
    }
}

if (nrow(high_cor_pairs) > 0) {
    high_cor_pairs <- high_cor_pairs %>%
        arrange(desc(abs(Correlation)))

    kable(high_cor_pairs,
          caption = "Predictor pairs with |r| > 0.7 (potential multicollinearity issues)")
} else {
    cat("No predictor pairs with |r| > 0.7\n")
}
Predictor pairs with |r| > 0.7 (potential multicollinearity issues)
Var1 Var2 Correlation
wind_gust_sum wind_gust_hours 1.000
butterflies_95th_percentile_t_1 butterflies_top3_mean_t_1 0.995
max_butterflies_t_1 butterflies_top3_mean_t_1 0.982
butterflies_95th_percentile_t_1 max_butterflies_t_1 0.968
wind_gust_sum wind_minutes_above_2ms 0.963
wind_gust_hours wind_minutes_above_2ms 0.963
wind_gust_sum_above_2ms wind_minutes_above_2ms 0.962
wind_avg_sustained wind_gust_sum 0.956
wind_avg_sustained wind_gust_hours 0.956
wind_gust_sum wind_gust_sum_above_2ms 0.953
wind_gust_sum_above_2ms wind_gust_hours 0.953
wind_avg_sustained wind_minutes_above_2ms 0.938
temp_max degree_hours_above_15C 0.919
wind_avg_sustained wind_gust_sum_above_2ms 0.917
wind_max_gust wind_gust_sum_above_2ms 0.830
wind_max_gust wind_gust_sd 0.830
wind_avg_sustained wind_mode_gust 0.824
wind_max_gust wind_gust_sum 0.824
wind_max_gust wind_gust_hours 0.824
wind_gust_sum wind_mode_gust 0.800
wind_gust_hours wind_mode_gust 0.800
temp_mean hours_above_15C 0.767
wind_max_gust wind_minutes_above_2ms 0.767
temp_mean degree_hours_above_15C 0.754
wind_minutes_above_2ms wind_mode_gust 0.743
wind_avg_sustained wind_max_gust 0.740
wind_gust_sum_above_2ms wind_gust_sd 0.738
temp_max temp_mean 0.711

8 Model Building Strategy

Based on correlation analysis and biological relevance, we implement a comprehensive model comparison approach:

8.1 Selected Fixed Effects

Always included (in all models): - max_butterflies_t_1: Baseline count (tested both as smooth and linear) - lag_duration_hours: Window duration control (tested both as smooth and linear)

Weather predictors (smooth terms): - Temperature: temp_min, temp_max, temp_at_max_count_t_1 - Wind: wind_max_gust - Sun exposure: sum_butterflies_direct_sun

Response variable: butterfly_diff_sqrt

Random structure (constant across all models): - Deployment random effect: s(deployment_id, bs="re") - AR1 temporal autocorrelation: correlation = corAR1(form = ~ observation_order_t | deployment_id)

8.2 Model Complexity Levels

We fit models ranging from simple (single predictors) to complex (all terms + interactions):

  1. Null model: Baseline + controls only
  2. Single predictor models: One weather variable at a time
  3. Additive models: Multiple predictors, no interactions
  4. Two-way interaction models: Selected biologically relevant interactions
  5. Full model: All terms + key interactions

8.3 Functional Form Comparison

For max_butterflies_t_1 and lag_duration_hours, we compare: - Version A: Both as smooth terms s(...) - Version B: Both as linear terms

AIC will determine optimal functional form.

8.4 Model Naming Convention

Models named: m[complexity]_[version] - Complexity: null, temp, wind, sun, additive, interact, full - Version: smooth or linear (for baseline/lag functional form)

Code
# Load required library for AR1
library(nlme)

# Prepare data
model_data <- daily_data %>%
    filter(metrics_complete >= 0.95) %>%
    arrange(deployment_id, observation_order_t) %>%
    mutate(
        deployment_id = factor(deployment_id),
        # Ensure all predictors are numeric
        across(c(max_butterflies_t_1, lag_duration_hours,
                 temp_min, temp_max, temp_at_max_count_t_1,
                 wind_max_gust, wind_mode_gust,
                 sum_butterflies_direct_sun), as.numeric)
    ) %>%
    # Remove any rows with NA in key variables
    filter(!is.na(butterfly_diff_sqrt),
           !is.na(max_butterflies_t_1),
           !is.na(lag_duration_hours))

cat("Model data:", nrow(model_data), "observations\n")
Model data: 96 observations
Code
cat("Deployments:", n_distinct(model_data$deployment_id), "\n")
Deployments: 6 

9 Model Fitting

Code
# Initialize model list
models <- list()

# AR1 correlation structure
ar1_cor <- corAR1(form = ~ observation_order_t | deployment_id)

# Set k values based on unique values in data
# lag_duration_hours has 37 unique values, k=5 is conservative
k_baseline <- 5
k_lag <- 5

# ============================================================================
# SMOOTH VERSION: max_butterflies_t_1 and lag_duration_hours as smooths
# ============================================================================

# Null model (controls only)
models$null_smooth <- gamm(
    butterfly_diff_sqrt ~ s(max_butterflies_t_1, k = 5) + s(lag_duration_hours, k = 5) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

# Single temperature predictors
models$temp_min_smooth <- gamm(
    butterfly_diff_sqrt ~ s(max_butterflies_t_1, k = k_baseline) + s(lag_duration_hours, k = k_lag) +
                          s(temp_min) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

models$temp_max_smooth <- gamm(
    butterfly_diff_sqrt ~ s(max_butterflies_t_1, k = k_baseline) + s(lag_duration_hours, k = k_lag) +
                          s(temp_max) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

models$temp_at_max_smooth <- gamm(
    butterfly_diff_sqrt ~ s(max_butterflies_t_1, k = k_baseline) + s(lag_duration_hours, k = k_lag) +
                          s(temp_at_max_count_t_1) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

# Single wind predictors
models$wind_max_smooth <- gamm(
    butterfly_diff_sqrt ~ s(max_butterflies_t_1, k = k_baseline) + s(lag_duration_hours, k = k_lag) +
                          s(wind_max_gust) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

# Sun exposure
models$sun_smooth <- gamm(
    butterfly_diff_sqrt ~ s(max_butterflies_t_1, k = k_baseline) + s(lag_duration_hours, k = k_lag) +
                          s(sum_butterflies_direct_sun) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

# Additive: all temperature
models$temp_additive_smooth <- gamm(
    butterfly_diff_sqrt ~ s(max_butterflies_t_1, k = k_baseline) + s(lag_duration_hours, k = k_lag) +
                          s(temp_min) + s(temp_max) + s(temp_at_max_count_t_1) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

# Additive: temp + wind
models$temp_wind_smooth <- gamm(
    butterfly_diff_sqrt ~ s(max_butterflies_t_1, k = k_baseline) + s(lag_duration_hours, k = k_lag) +
                          s(temp_min) + s(temp_max) + s(temp_at_max_count_t_1) +
                          s(wind_max_gust) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

# Additive: temp + wind + sun
models$full_additive_smooth <- gamm(
    butterfly_diff_sqrt ~ s(max_butterflies_t_1, k = k_baseline) + s(lag_duration_hours, k = k_lag) +
                          s(temp_min) + s(temp_max) + s(temp_at_max_count_t_1) +
                          s(wind_max_gust) +
                          s(sum_butterflies_direct_sun) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

# Interactions: temp x wind (biologically relevant)
models$temp_wind_interact_smooth <- gamm(
    butterfly_diff_sqrt ~ s(max_butterflies_t_1, k = k_baseline) + s(lag_duration_hours, k = k_lag) +
                          s(temp_min) + s(temp_max) + s(temp_at_max_count_t_1) +
                          s(wind_max_gust) +
                          ti(temp_min, wind_max_gust) +
                          ti(temp_max, wind_max_gust) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

# Full interaction model with sun
models$full_interact_smooth <- gamm(
    butterfly_diff_sqrt ~ s(max_butterflies_t_1, k = k_baseline) + s(lag_duration_hours, k = k_lag) +
                          s(temp_min) + s(temp_max) + s(temp_at_max_count_t_1) +
                          s(wind_max_gust) +
                          s(sum_butterflies_direct_sun) +
                          ti(temp_min, wind_max_gust) +
                          ti(temp_max, wind_max_gust) +
                          ti(temp_min, sum_butterflies_direct_sun) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

# ============================================================================
# LINEAR VERSION: max_butterflies_t_1 and lag_duration_hours as linear
# ============================================================================

# Null model (controls only)
models$null_linear <- gamm(
    butterfly_diff_sqrt ~ max_butterflies_t_1 + lag_duration_hours +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

# Single temperature predictors
models$temp_min_linear <- gamm(
    butterfly_diff_sqrt ~ max_butterflies_t_1 + lag_duration_hours +
                          s(temp_min) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

models$temp_max_linear <- gamm(
    butterfly_diff_sqrt ~ max_butterflies_t_1 + lag_duration_hours +
                          s(temp_max) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

models$temp_at_max_linear <- gamm(
    butterfly_diff_sqrt ~ max_butterflies_t_1 + lag_duration_hours +
                          s(temp_at_max_count_t_1) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

# Single wind predictors
models$wind_max_linear <- gamm(
    butterfly_diff_sqrt ~ max_butterflies_t_1 + lag_duration_hours +
                          s(wind_max_gust) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

# Sun exposure
models$sun_linear <- gamm(
    butterfly_diff_sqrt ~ max_butterflies_t_1 + lag_duration_hours +
                          s(sum_butterflies_direct_sun) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

# Additive: all temperature
models$temp_additive_linear <- gamm(
    butterfly_diff_sqrt ~ max_butterflies_t_1 + lag_duration_hours +
                          s(temp_min) + s(temp_max) + s(temp_at_max_count_t_1) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

# Additive: temp + wind
models$temp_wind_linear <- gamm(
    butterfly_diff_sqrt ~ max_butterflies_t_1 + lag_duration_hours +
                          s(temp_min) + s(temp_max) + s(temp_at_max_count_t_1) +
                          s(wind_max_gust) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

# Additive: temp + wind + sun
models$full_additive_linear <- gamm(
    butterfly_diff_sqrt ~ max_butterflies_t_1 + lag_duration_hours +
                          s(temp_min) + s(temp_max) + s(temp_at_max_count_t_1) +
                          s(wind_max_gust) +
                          s(sum_butterflies_direct_sun) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

# Interactions: temp x wind
models$temp_wind_interact_linear <- gamm(
    butterfly_diff_sqrt ~ max_butterflies_t_1 + lag_duration_hours +
                          s(temp_min) + s(temp_max) + s(temp_at_max_count_t_1) +
                          s(wind_max_gust) +
                          ti(temp_min, wind_max_gust) +
                          ti(temp_max, wind_max_gust) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

# Full interaction model with sun
models$full_interact_linear <- gamm(
    butterfly_diff_sqrt ~ max_butterflies_t_1 + lag_duration_hours +
                          s(temp_min) + s(temp_max) + s(temp_at_max_count_t_1) +
                          s(wind_max_gust) +
                          s(sum_butterflies_direct_sun) +
                          ti(temp_min, wind_max_gust) +
                          ti(temp_max, wind_max_gust) +
                          ti(temp_min, sum_butterflies_direct_sun) +
                          s(deployment_id, bs = "re"),
    data = model_data,
    correlation = ar1_cor,
    method = "REML"
)

cat("Total models fitted:", length(models), "\n")
Total models fitted: 22 

10 Model Comparison

Code
# Extract AIC for all models
model_comparison <- tibble(
    model = names(models),
    AIC = sapply(models, function(m) AIC(m$lme)),
    BIC = sapply(models, function(m) BIC(m$lme)),
    logLik = sapply(models, function(m) as.numeric(logLik(m$lme))),
    df = sapply(models, function(m) attr(logLik(m$lme), "df"))
) %>%
    mutate(
        delta_AIC = AIC - min(AIC),
        weight = exp(-0.5 * delta_AIC) / sum(exp(-0.5 * delta_AIC))
    ) %>%
    arrange(AIC)

# Display full comparison table
kable(
    model_comparison %>%
        mutate(across(c(AIC, BIC, logLik), ~round(.x, 2)),
               across(c(delta_AIC, weight), ~round(.x, 4))),
    caption = "Model comparison ranked by AIC"
)
Model comparison ranked by AIC
model AIC BIC logLik df delta_AIC weight
full_interact_smooth 633.65 699.60 -289.82 27 0.0000 0.9751
full_interact_linear 641.66 702.72 -295.83 25 8.0065 0.0178
temp_wind_interact_smooth 644.36 698.61 -300.18 22 10.7076 0.0046
temp_max_smooth 646.87 672.09 -313.43 10 13.2191 0.0013
sun_smooth 650.23 675.45 -315.12 10 16.5854 0.0002
full_additive_smooth 650.32 694.91 -307.16 18 16.6734 0.0002
temp_additive_smooth 650.85 685.85 -311.43 14 17.2042 0.0002
temp_at_max_smooth 651.32 676.54 -315.66 10 17.6686 0.0001
null_smooth 651.52 671.78 -317.76 8 17.8669 0.0001
temp_wind_smooth 652.41 692.23 -310.21 16 18.7611 0.0001
temp_min_smooth 652.85 678.07 -316.42 10 19.1985 0.0001
wind_max_smooth 652.93 678.15 -316.47 10 19.2822 0.0001
temp_wind_interact_linear 653.28 702.60 -306.64 20 19.6303 0.0001
temp_max_linear 656.44 676.61 -320.22 8 22.7867 0.0000
temp_additive_linear 659.84 689.84 -317.92 12 26.1913 0.0000
null_linear 660.69 675.88 -324.34 6 27.0375 0.0000
sun_linear 660.69 680.87 -322.35 8 27.0433 0.0000
full_additive_linear 661.12 700.76 -314.56 16 27.4696 0.0000
temp_wind_linear 661.64 696.49 -316.82 14 27.9953 0.0000
temp_min_linear 661.67 681.84 -322.84 8 28.0216 0.0000
wind_max_linear 662.14 682.31 -323.07 8 28.4879 0.0000
temp_at_max_linear 662.72 682.89 -323.36 8 29.0676 0.0000
Code
# Highlight top 5 models
cat("\n=== TOP 5 MODELS ===\n\n")

=== TOP 5 MODELS ===
Code
top_5 <- model_comparison %>% head(5)
for (i in 1:nrow(top_5)) {
    cat(sprintf("%d. %s (AIC=%.2f, ΔAIC=%.2f, weight=%.4f)\n",
                i, top_5$model[i], top_5$AIC[i], top_5$delta_AIC[i], top_5$weight[i]))
}
1. full_interact_smooth (AIC=633.65, <U+0394>AIC=0.00, weight=0.9751)
2. full_interact_linear (AIC=641.66, <U+0394>AIC=8.01, weight=0.0178)
3. temp_wind_interact_smooth (AIC=644.36, <U+0394>AIC=10.71, weight=0.0046)
4. temp_max_smooth (AIC=646.87, <U+0394>AIC=13.22, weight=0.0013)
5. sun_smooth (AIC=650.23, <U+0394>AIC=16.59, weight=0.0002)

11 Best Model Summary

Code
# Extract best model
best_model_name <- model_comparison$model[1]
best_model <- models[[best_model_name]]

cat("=== BEST MODEL:", best_model_name, "===\n\n")
=== BEST MODEL: full_interact_smooth ===
Code
# Summary of GAM component
summary(best_model$gam)

Family: gaussian 
Link function: identity 

Formula:
butterfly_diff_sqrt ~ s(max_butterflies_t_1, k = k_baseline) + 
    s(lag_duration_hours, k = k_lag) + s(temp_min) + s(temp_max) + 
    s(temp_at_max_count_t_1) + s(wind_max_gust) + s(sum_butterflies_direct_sun) + 
    ti(temp_min, wind_max_gust) + ti(temp_max, wind_max_gust) + 
    ti(temp_min, sum_butterflies_direct_sun) + s(deployment_id, 
    bs = "re")

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -0.5041     1.1641  -0.433    0.666

Approximate significance of smooth terms:
                                          edf Ref.df      F  p-value    
s(max_butterflies_t_1)                  1.000  1.000 29.219 8.46e-07 ***
s(lag_duration_hours)                   1.000  1.000  1.996  0.16178    
s(temp_min)                             1.000  1.000  0.080  0.77849    
s(temp_max)                             2.511  2.511  2.366  0.07524 .  
s(temp_at_max_count_t_1)                1.000  1.000  0.025  0.87570    
s(wind_max_gust)                        1.648  1.648  1.431  0.15986    
s(sum_butterflies_direct_sun)           1.776  1.776  2.790  0.03982 *  
ti(temp_min,wind_max_gust)              2.136  2.136  1.544  0.17318    
ti(temp_max,wind_max_gust)              1.000  1.000  6.789  0.01100 *  
ti(temp_min,sum_butterflies_direct_sun) 3.165  3.165  5.485  0.00112 ** 
s(deployment_id)                        2.051  6.000  0.679  0.08775 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.529   
  Scale est. = 32.824    n = 96
Code
# Summary of LME component (shows AR1 parameters)
cat("\n=== CORRELATION STRUCTURE ===\n")

=== CORRELATION STRUCTURE ===
Code
summary(best_model$lme$modelStruct$corStruct)
Correlation Structure: ARMA(1,0)
 Formula: ~observation_order_t | g/g.0/g.1/g.2/g.3/g.4/g.5/g.6/g.7/g.8/g.9/deployment_id 
 Parameter estimate(s):
     Phi1 
0.1375867 

12 Partial Effects Plots (Best Model)

Code
# Plot smooth terms from best model
draw(best_model$gam, residuals = TRUE)

13 Model Diagnostics (Best Model)

13.1 Residual Diagnostics

Code
# Extract residuals
model_resid <- residuals(best_model$lme, type = "normalized")
model_fitted <- fitted(best_model$lme)

# Create diagnostic plots
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))

# 1. Residuals vs Fitted
plot(model_fitted, model_resid,
     xlab = "Fitted values", ylab = "Normalized residuals",
     main = "Residuals vs Fitted",
     pch = 19, col = rgb(0, 0, 0, 0.5))
abline(h = 0, col = "red", lwd = 2, lty = 2)
lines(lowess(model_fitted, model_resid), col = "blue", lwd = 2)

# 2. Q-Q plot
qqnorm(model_resid, main = "Normal Q-Q Plot",
       pch = 19, col = rgb(0, 0, 0, 0.5))
qqline(model_resid, col = "red", lwd = 2)

# 3. Scale-Location
sqrt_abs_resid <- sqrt(abs(model_resid))
plot(model_fitted, sqrt_abs_resid,
     xlab = "Fitted values", ylab = "√|Normalized residuals|",
     main = "Scale-Location",
     pch = 19, col = rgb(0, 0, 0, 0.5))
lines(lowess(model_fitted, sqrt_abs_resid), col = "blue", lwd = 2)

# 4. Residuals vs Order (temporal autocorrelation check)
plot(seq_along(model_resid), model_resid,
     xlab = "Observation order", ylab = "Normalized residuals",
     main = "Residuals vs Order",
     pch = 19, col = rgb(0, 0, 0, 0.5))
abline(h = 0, col = "red", lwd = 2, lty = 2)
lines(lowess(seq_along(model_resid), model_resid), col = "blue", lwd = 2)

13.2 Normality Tests

Code
# Shapiro-Wilk test on residuals
shapiro_test <- shapiro.test(model_resid)

cat("Shapiro-Wilk Normality Test on Residuals\n")
Shapiro-Wilk Normality Test on Residuals
Code
cat("=========================================\n")
=========================================
Code
cat("W statistic:", round(shapiro_test$statistic, 4), "\n")
W statistic: 0.9852 
Code
cat("p-value:", round(shapiro_test$p.value, 4), "\n")
p-value: 0.3553 
Code
if (shapiro_test$p.value > 0.05) {
    cat("Result: Cannot reject normality (p > 0.05)\n")
} else {
    cat("Result: Residuals deviate from normality (p < 0.05)\n")
}
Result: Cannot reject normality (p > 0.05)
Code
cat("\nSkewness:", round(e1071::skewness(model_resid), 3), "\n")

Skewness: 0.214 
Code
cat("Kurtosis:", round(e1071::kurtosis(model_resid), 3), "\n")
Kurtosis: -0.442 

13.3 Autocorrelation Diagnostics

Code
par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

# ACF plot
acf(model_resid, main = "ACF of Normalized Residuals", lag.max = 20)

# PACF plot
pacf(model_resid, main = "PACF of Normalized Residuals", lag.max = 20)

13.4 Influence Diagnostics

Code
# Leverage and standardized residuals as influence measures
# For mixed models, we use standardized residuals and leverage-like measures

# Standardized residuals
std_resid <- model_resid  # Already normalized from lme

# Identify potential outliers (|std resid| > 2.5)
outlier_threshold <- 2.5
outliers <- abs(std_resid) > outlier_threshold
n_influential <- sum(outliers)

par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))

# Standardized residuals index plot
plot(seq_along(std_resid), std_resid, type = "h",
     xlab = "Observation", ylab = "Standardized Residual",
     main = "Standardized Residuals",
     col = ifelse(outliers, "red", "black"))
abline(h = c(-outlier_threshold, 0, outlier_threshold),
       col = c("red", "gray", "red"), lty = c(2, 1, 2))

cat("\nInfluential Observations (|std resid| > 2.5):\n")

Influential Observations (|std resid| > 2.5):
Code
cat("Threshold: ±", outlier_threshold, "\n")
Threshold: <U+00B1> 2.5 
Code
cat("Number of potential outliers:", n_influential, "\n")
Number of potential outliers: 0 
Code
if (n_influential > 0) {
    influential_obs <- which(outliers)
    cat("Observation indices:", paste(influential_obs, collapse = ", "), "\n")
    cat("Residual values:", paste(round(std_resid[outliers], 2), collapse = ", "), "\n")
}

# Histogram of standardized residuals
hist(std_resid, breaks = 30,
     xlab = "Standardized Residual",
     main = "Distribution of Standardized Residuals",
     col = "steelblue", border = "black")
abline(v = c(-outlier_threshold, outlier_threshold),
       col = "red", lwd = 2, lty = 2)

13.5 GAM-Specific Diagnostics

Code
# mgcv's built-in diagnostic plots
gam.check(best_model$gam)


'gamm' based fit - care required with interpretation.
Checks based on working residuals may be misleading.
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

                                           k'   edf k-index p-value
s(max_butterflies_t_1)                   4.00  1.00    1.03    0.56
s(lag_duration_hours)                    4.00  1.00    1.17    0.93
s(temp_min)                              9.00  1.00    0.92    0.15
s(temp_max)                              9.00  2.51    1.14    0.93
s(temp_at_max_count_t_1)                 9.00  1.00    1.01    0.55
s(wind_max_gust)                         9.00  1.65    1.02    0.56
s(sum_butterflies_direct_sun)            9.00  1.78    1.09    0.79
ti(temp_min,wind_max_gust)              16.00  2.14    1.11    0.87
ti(temp_max,wind_max_gust)              16.00  1.00    1.02    0.58
ti(temp_min,sum_butterflies_direct_sun) 16.00  3.16    1.11    0.84
s(deployment_id)                         6.00  2.05      NA      NA

13.6 Summary of Diagnostic Checks

Code
cat("\n=== DIAGNOSTIC SUMMARY ===\n\n")

=== DIAGNOSTIC SUMMARY ===
Code
# 1. Normality
cat("1. NORMALITY OF RESIDUALS\n")
1. NORMALITY OF RESIDUALS
Code
cat("   Shapiro-Wilk p-value:", round(shapiro_test$p.value, 4), "\n")
   Shapiro-Wilk p-value: 0.3553 
Code
if (shapiro_test$p.value > 0.05) {
    cat("   ✓ Residuals are approximately normal\n\n")
} else {
    cat("   ⚠ Some deviation from normality detected\n\n")
}
   <U+2713> Residuals are approximately normal
Code
# 2. Homoscedasticity (visual check from scale-location plot)
cat("2. HOMOSCEDASTICITY\n")
2. HOMOSCEDASTICITY
Code
cat("   Check Scale-Location plot above\n")
   Check Scale-Location plot above
Code
cat("   Look for horizontal trend line (constant variance)\n\n")
   Look for horizontal trend line (constant variance)
Code
# 3. Autocorrelation
cat("3. TEMPORAL AUTOCORRELATION\n")
3. TEMPORAL AUTOCORRELATION
Code
cat("   AR1 parameter:", round(coef(best_model$lme$modelStruct$corStruct, unconstrained = FALSE), 3), "\n")
   AR1 parameter: 0.138 
Code
cat("   Check ACF/PACF plots for remaining autocorrelation\n\n")
   Check ACF/PACF plots for remaining autocorrelation
Code
# 4. Influential points
cat("4. INFLUENTIAL OBSERVATIONS\n")
4. INFLUENTIAL OBSERVATIONS
Code
cat("   Number of potential outliers:", n_influential, "/", length(model_resid), "\n")
   Number of potential outliers: 0 / 96 
Code
if (n_influential == 0) {
    cat("   ✓ No potential outliers (|std resid| > 2.5)\n\n")
} else {
    cat("   ⚠", n_influential, "observations may be outliers\n\n")
}
   <U+2713> No potential outliers (|std resid| > 2.5)
Code
# 5. Basis dimension check
cat("5. BASIS DIMENSION ADEQUACY\n")
5. BASIS DIMENSION ADEQUACY
Code
cat("   See k-index values in gam.check() output above\n")
   See k-index values in gam.check() output above
Code
cat("   k-index > 1.0 and p-value > 0.05 indicate adequate basis dimensions\n")
   k-index > 1.0 and p-value > 0.05 indicate adequate basis dimensions

Model fitting complete. AIC-based selection identifies optimal model structure. Diagnostics confirm model assumptions.